Iris Flowers Case Study

Multi-class classification Dataset Description: https://archive.ics.uci.edu/ml/datasets/Iris

Load libraries


In [59]:
library(caret)

Load Data


In [60]:
# attach the iris dataset to the environment
data(iris)
# rename the dataset
dataset <- iris

Split out validation dataset

We need to know whether the model that we created is any good. We are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.


In [61]:
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- dataset[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset <- dataset[validation_index,]

You now have training data in the dataset variable and a validation set we will use later in the validation variable. Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.

Summarize Dataset

Dimensions of the Dataset. How many rows and columns


In [62]:
dim(dataset)


  1. 120
  2. 5

list the types for each attribute. gives us an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it


In [63]:
sapply(dataset, class)


Sepal.Length
'numeric'
Sepal.Width
'numeric'
Petal.Length
'numeric'
Petal.Width
'numeric'
Species
'factor'

In [64]:
# take a peek at the first 5 rows of the data
head(dataset)


Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.1 3.5 1.4 0.2 setosa
24.9 3.0 1.4 0.2 setosa
44.6 3.1 1.5 0.2 setosa
55.0 3.6 1.4 0.2 setosa
74.6 3.4 1.4 0.3 setosa
85.0 3.4 1.5 0.2 setosa

list the levels for the class. This is a multiclass or a multinomial classification problem. If there were two levels, it would be a binary classification problem.


In [65]:
levels(dataset$Species)


  1. 'setosa'
  2. 'versicolor'
  3. 'virginica'

Class Distribution. a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage. each class has the same number of instances (40 or 33% of the dataset)


In [66]:
# summarize the class distribution
percentage <- prop.table(table(dataset$Species)) * 100
cbind(freq=table(dataset$Species), percentage=percentage)


freqpercentage
setosa40 33.33333
versicolor40 33.33333
virginica40 33.33333

Statistical Summary includes the mean, the min and max values as well as some percentiles.


In [67]:
# summarize attribute distributions
summary(dataset)


  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.075   1st Qu.:2.800   1st Qu.:1.575   1st Qu.:0.300  
 Median :5.750   Median :3.000   Median :4.250   Median :1.300  
 Mean   :5.813   Mean   :3.016   Mean   :3.756   Mean   :1.192  
 3rd Qu.:6.400   3rd Qu.:3.225   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.200   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :40  
 versicolor:40  
 virginica :40  
                
                
                

We can see that all of the numerical values have the same scale (centimeters) and similar ranges [0,8] centimeters.

Visualize Dataset

It is helpful with visualization to have a way to refer to just the input attributes and just the output attributes. Let’s set that up and call the input attributes x and the output attribute (or class) y.


In [68]:
# a) Univariate

# split input and output
        #name the features then split into x and y by feature name and label name
x <- dataset[,1:4]
y <- dataset[,5]


#how to refer to specific columns by number
#df[,c(1:3,6,9:12)]

Given that the input variables are numeric, we can create box and whisker plots of each.


In [69]:
# boxplots for numeric
par(mfrow=c(1,4))
for(i in 1:4) {
    boxplot(x[,i], main=names(dataset)[i])
}



In [70]:
# barplot for class breakdown
plot(y)
#This confirms what we learned in the last section, that the instances are evenly distributedacross the three classes


Multivariate Plots look at the interactions between the variables let’s look at scatter plots of all pairs of attributes and color the points by class. In addition, because the scatter plots show that points for each class are generally separate, we can draw ellipses around them

We can see some clear relationships between the input attributes (trends) and between attributes and the class values (ellipses)


In [71]:
# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse")


We can also look at box and whisker plots of each input variable again, but this time broken down into separate plots for each class. This can help to tease out obvious linear separations between the classes. This is useful as it shows that there are clearly different distributions of the attributes for each class value.


In [72]:
# box and whisker plots for each attribute
featurePlot(x=x, y=y, plot="box")


Next we can get an idea of the distribution of each attribute, again like the box and whisker plots, broken down by class value. Sometimes histograms are good for this, but in this case we will use some probability density plots to give nice smooth lines for each distribution

Like the boxplots, we can see the difference in distribution of each attribute by class value. We can also see the Gaussian-like distribution (bell curve) of each attribute.


In [73]:
# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)


Evaluate Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Test Harness

Setup the test harness to use 10-fold cross-validation. We will use 10-fold cross-validation to estimate accuracy. This will split our dataset into 10 parts, train in 9 and test on 1 and repeat for all combinations of train-test splits.

We are using the metric of Accuracy to evaluate models. This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we build and evaluate each models in the next section.


In [74]:
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We do get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results. Let’s evaluate 5 different algorithms: Linear Discriminant Analysis (LDA). Classification and Regression Trees (CART). k-Nearest Neighbors (KNN). Support Vector Machines (SVM) with a radial kernel. Random Forest (RF).

This is a good mixture of simple linear (LDA), nonlinear (CART, KNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.


In [75]:
# a) linear algorithms
# LDA
set.seed(7)
fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)

Select Best Model

We now have 5 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate. We can report on the accuracy of each model by first creating a list of the models, gathering resample statistics and using the summary function on the result.


In [76]:
# d) compare algorithms
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)


Call:
summary.resamples(object = results)

Models: lda, cart, knn, svm, rf 
Number of resamples: 10 

Accuracy 
       Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
lda  0.9167  0.9375 1.0000 0.9750  1.0000    1    0
cart 0.7500  0.9167 0.9167 0.9000  0.9167    1    0
knn  0.8333  0.9167 1.0000 0.9583  1.0000    1    0
svm  0.7500  0.9167 1.0000 0.9417  1.0000    1    0
rf   0.7500  0.9167 1.0000 0.9417  1.0000    1    0

Kappa 
      Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
lda  0.875  0.9062  1.000 0.9625   1.000    1    0
cart 0.625  0.8750  0.875 0.8500   0.875    1    0
knn  0.750  0.8750  1.000 0.9375   1.000    1    0
svm  0.625  0.8750  1.000 0.9125   1.000    1    0
rf   0.625  0.8750  1.000 0.9125   1.000    1    0

In the table of results, we can see the distribution of both the Accuracy and Kappa of the models. Let’s just focus on Accuracy for now.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross-validation).


In [77]:
dotplot(results)


The results for just the LDA model can be summarized.


In [78]:
# summarize Best Model
#  Output estimated accuracy of a model.
print(fit.lda)


Linear Discriminant Analysis 

120 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
Resampling results:

  Accuracy  Kappa 
  0.975     0.9625

 

This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 97.5% accuracy +/- 4%.

Make Predictions

Estimate Skill on Validation Dataset

The LDA algorithm created the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set. This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result. We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.


In [79]:
set.seed(7)
predictions <- predict(fit.lda, newdata=validation)
confusionMatrix(predictions, validation$Species)


Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          0        10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 4.857e-15  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000

We can see that the accuracy is 100%. It was a small validation dataset, but this result is within our expected margin of 97% +/-4% suggesting we may have an accurate and a reliably accurate model.